#Include the libraries we are going to need here
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(ggplot2)
library(plyr)
#library(tidyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x purrr::lift()      masks caret::lift()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(ggcorrplot)
library(knitr)
library(splines)
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 4.0-2
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(grid)
library(RColorBrewer)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(gbm)
## Loaded gbm 2.1.8
library(arules) # for Association Rules analysis
## 
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
library(gridExtra)
library(tree)
## Registered S3 method overwritten by 'tree':
##   method     from
##   print.tree cli
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(cluster)
library(arulesViz)

nb.cols <- 18
mycolors <- colorRampPalette(brewer.pal(8, "Blues"))(nb.cols)
cookiecol<-c('#ad6a1d','#9a5327','#cc8d4a','#4e1703','#ecc78d','#2e0a05','#d0dfe4','#33575b', '#173742')

1 Background

The main dataset is the transaction record of a bakery (https://www.kaggle.com/sulmansarwar/transactions-from-a-bakery). Though not specified in the description of the dataset, it is implied that this bakery is located in the old town of Edinburgh, UK. The dataset is downloaded from kaggle and stored under the same path as this R markdown file.

In order to supplement the dataset with more features, we extract the historical weather records from Meteostat (https://dev.meteostat.net/python/hourly.html#response-parameters). We use the weather data recorded by a weather station in Edinburgh Airport, which is about 8 miles away from the Edinburgh old town. The dataset is downloaded and stored under the same path of this R markdown file.

2 Questions To Be Answered

Combining the bakery transaction record and the historical weather record, there are few questions we could further explore, such as:

  1. Knowing the date, hour, and weather forecast, to predict the number of transactions during that specific hour.

  2. Which items tend to be purchased together?

  3. Is there any cluster of the transactions?

3 Data Cleaning and Wrangling

3.1 Read The Dataset

Firstly we will read the dataset of the bakery transaction record, and the hourly weather record in Edinburgh.

df_ori <- read.csv(file = 'BreadBasket_DMS.csv',
               header = TRUE,
               encoding = 'utf-8')

df_weather_ori <- read.csv(file = 'Edinburgh_weather_hourly.csv',
                       header = TRUE,
                       encoding = 'utf-8')

Then we display the first few rows of each data set.

head(df_ori)
##         Date     Time Transaction          Item
## 1 2016-10-30 09:58:11           1         Bread
## 2 2016-10-30 10:05:34           2  Scandinavian
## 3 2016-10-30 10:05:34           2  Scandinavian
## 4 2016-10-30 10:07:57           3 Hot chocolate
## 5 2016-10-30 10:07:57           3           Jam
## 6 2016-10-30 10:07:57           3       Cookies
head(df_weather_ori)
##            time temp dwpt rhum prcp snow wdir wspd wpgt pres tsun coco station
## 1 1/1/2016 0:00    2  0.1   87   NA   NA   NA   NA   NA 1013   NA   NA    3160
## 2 1/1/2016 1:00    2  0.1   87   NA   NA  240 22.3   NA 1014   NA   NA    3160
## 3 1/1/2016 2:00    2  0.1   87   NA   NA  230 25.9   NA 1015   NA   NA    3160
## 4 1/1/2016 3:00    3 -1.0   75   NA   NA  230 20.5   NA 1015   NA   NA    3160
## 5 1/1/2016 4:00    2 -2.0   75   NA   NA  240  9.4   NA 1016   NA   NA    3160
## 6 1/1/2016 5:00    2 -2.0   75   NA   NA   NA  3.6   NA 1016   NA   NA    3160

We will create the keys in order to merge two datasets.

df <- df_ori
df_weather <- df_weather_ori
df$Date <- as.Date(df$Date, "%Y-%m-%d")
df$Hour <- as.numeric(substr(df$Time, 1, 2))
df$key <- paste(as.character(df$Date), "@", as.character(df$Hour))
df_weather <- df_weather %>% separate(time, c("Date", "Hour_Minute"), " ")
df_weather <- df_weather %>% separate(Hour_Minute, c("Hour", "Minute"), ":")
df_weather$Date <- as.Date(df_weather$Date, "%m/%d/%Y")
df_weather$Hour <- as.numeric(df_weather$Hour)
df_weather$key <- paste(as.character(df_weather$Date), "@", as.character(df_weather$Hour))

3.2 The Bakery Transaction Record

3.2.1 NA Values and Duplicated Rows

Now we will deal with the data type and missing values, if any. We will start with the bakery transaction record database.

summary(df)
##       Date                  Time        Transaction        Item     
##  Min.   :2016-10-30   12:07:39:   16   Min.   :   1   Coffee :5471  
##  1st Qu.:2016-12-03   10:45:21:   13   1st Qu.:2548   Bread  :3325  
##  Median :2017-01-21   10:55:19:   13   Median :5067   Tea    :1435  
##  Mean   :2017-01-17   14:38:01:   13   Mean   :4952   Cake   :1025  
##  3rd Qu.:2017-02-28   13:43:08:   12   3rd Qu.:7329   Pastry : 856  
##  Max.   :2017-04-09   14:19:47:   12   Max.   :9684   NONE   : 786  
##                       (Other) :21214                  (Other):8395  
##       Hour           key           
##  Min.   : 1.00   Length:21293      
##  1st Qu.:10.00   Class :character  
##  Median :12.00   Mode  :character  
##  Mean   :12.25                     
##  3rd Qu.:14.00                     
##  Max.   :23.00                     
## 

There is no NAs. Then we will verify whether there are duplicated rows in the dataset

dim(df[duplicated(df), ])[1]
## [1] 1653

There are duplicated rows. Considering that our analysis will not focus on the quantities of item sold, it is OK to remove those duplicated rows.

df <- distinct(df)

3.2.2 Weekday vs. Weekend

Given the nature of the bakery business, we may expect different behaviors during weekdays and weekends.

df$Weekday <- weekdays(df$Date, abbreviate = TRUE)

df$Weekday <- factor(df$Weekday, 
                        levels = c('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'))


res <- ddply(df, ~Weekday, summarise, No_of_transaction = length(unique(Transaction)))

ggplot(data = res, mapping = aes(x = Weekday, y = No_of_transaction, fill = Weekday)) +
  geom_bar(stat = 'identity',width=0.7,alpha=0.8) +
  labs(title = 'No. of Transactions by Weekdays', x = 'Weekdays', y = 'Number of Transactions') +
  scale_fill_brewer(palette = "Blues") +
  theme_minimal()

As expected, there are more transactions on Saturday. However, the number of transactions on Sunday seem to be low. To further understand what happened, we will look at the number of transactions by hour by weekdays.

res <- ddply(df, .(Weekday, Hour), summarise, No_of_transaction = length(unique(Transaction)))

ggplot(data = res, mapping = aes(x = as.factor(Hour), y = No_of_transaction, fill = as.factor(Hour))) +
  geom_bar(stat = "identity",alpha=0.8) + facet_wrap(~ Weekday) +
  labs(title = 'No. of Transactions by Hour by Weekdays', x = 'Hours', y = 'Number of Transactions') +
  scale_fill_manual(values = mycolors) +
  theme_minimal() + 
  theme(legend.position = "none",
axis.text = element_text(size=8),
)

Looking at the distribution of transactions by hours, we noticed that the trend for Saturday and Sunday are similar and are different from the ones of the other days. Therefore, it makes sense to group Saturday and Sunday together, though Sunday has less transactions comparing with Saturday. This also implies that we may want to split the dataset into weekdays data and weekend data to perform further analysis.

df$Weekend <- mapvalues(df$Weekday,
                        from = c('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'),
                        to = c(0, 0, 0, 0, 0, 1, 1))

3.2.3 Hours of The Day

From the figure presented in 3.2.2, we noticed that there are few transactions associated with abnormal hours, such as 1 am and 11 pm.

res <- ddply(df, ~Hour, summarise, No_of_transaction = length(unique(Transaction)))
ggplot(data = res, mapping = aes(x = as.factor(Hour), y = No_of_transaction, fill = as.factor(Hour))) +
  geom_bar(stat = 'identity',alpha=0.8) +
  labs(title = 'No. of Transactions by Hour', x = 'Hours', y = 'Number of Transactions') +
  scale_fill_manual(values = mycolors) +
  theme_minimal() +
  theme(legend.position = "none")

Considering the small amount of transactions associated with abnormal hours, we will drop the rows whose hours is outside a normal business operating time. In other words, we will drop rows whose hour is 1, 21, 22, or 23 from our dataset.

df <- df%>%filter(Hour<21&Hour>1)

Another observation is that the amount of transaction within an hour varies by the time of the day. Here we split the day into two segments: rush hours from 9 to 15 and non-rush hours for the rest of the day. Such split makes sense practically as the period from 9 to 15 covers breakfast, lunch, and coffee or tea time in the afternoon.

df$Rush_hours <- mapvalues(df$Hour,
                        from = c(7, 8,
                                 9, 10, 11, 12, 13, 14, 15,
                                 16, 17, 18, 19, 20),
                        to = c(0, 0,
                               1, 1, 1, 1, 1, 1, 1,
                               0, 0, 0, 0, 0))
df$Rush_hours<-as.factor(df$Rush_hours)

3.2.4 Holidays

We also want to take a look at sales during holidays. We will focus on two holidays, Christmas and New Year. 0 = non-holiday 1 = Christmas|New Year (12/24/2016 - 1/1/2017) Bakery did not open on 12/25/2016, 12/26/2016, 1/1/2017

df$Holiday<-0
df$Holiday[df$Date>'2016-12-23'&df$Date<'2017-01-02']<-1
hol<-df%>%group_by(Holiday)%>%summarize(mean_transaction = length(unique(Transaction))/length(unique(Date)))
## `summarise()` ungrouping output (override with `.groups` argument)
hol%>%ggplot(aes(as.factor(Holiday),mean_transaction, fill = as.factor(Holiday),label = round(mean_transaction,2))) +
  labs(title = 'Average No. of Transactions by Holiday', x = 'Holiday', y = 'Average Number of Transactions') +
  geom_bar(stat = 'identity', width = 0.5,alpha=0.8) +
  geom_text(vjust = -0.5) +
  scale_fill_manual(values = cookiecol[c(5,7)]) +
  theme_minimal()

df$Holiday<-as.factor(df$Holiday)

Indeed, the average number of transaction on holidays is 12% less than the one of non-holidays.

3.2.5 List of Purchased Items

Then we will focus on the “Item” column to understand what are included.

Item_tb <- table(df$Item)
sort(Item_tb, decreasing = TRUE)
## 
##                        Coffee                         Bread 
##                          4528                          3096 
##                           Tea                          Cake 
##                          1350                           983 
##                        Pastry                          NONE 
##                           815                           753 
##                      Sandwich                     Medialuna 
##                           680                           585 
##                 Hot chocolate                       Cookies 
##                           551                           515 
##                       Brownie                    Farm House 
##                           379                           371 
##                         Juice                        Muffin 
##                           364                           364 
##                     Alfajores                         Scone 
##                           344                           327 
##                          Soup                         Toast 
##                           326                           318 
##                  Scandinavian                      Truffles 
##                           274                           192 
##                          Coke                Spanish Brunch 
##                           184                           172 
##                      Baguette                        Tiffin 
##                           152                           146 
##                         Fudge                           Jam 
##                           142                           142 
##                 Mineral water                Jammie Dodgers 
##                           133                           125 
##                  Chicken Stew             Hearty & Seasonal 
##                           123                           100 
##                         Salad                      Frittata 
##                            99                            81 
##                     Smoothies              Keeping It Local 
##                            77                            63 
##                     The Nomad                      Focaccia 
##                            58                            54 
##                Vegan mincepie                      Bakewell 
##                            52                            48 
##                       Tartine      Afternoon with the baker 
##                            46                            43 
##                      Art Tray          Extra Salami or Feta 
##                            38                            38 
##                          Eggs                       Granola 
##                            28                            28 
##                        Tshirt              My-5 Fruit Shoot 
##                            21                            18 
##        Ella's Kitchen Pouches                        Crisps 
##                            17                            14 
##                Dulce de Leche                      Duck egg 
##                            13                            12 
##                  Kids biscuit            Pick and Mix Bowls 
##                            12                            12 
##              Christmas common                Mighty Protein 
##                            11                            11 
##                  Tacos/Fajita              Valentine's card 
##                            11                            11 
##                      Postcard                    Chocolates 
##                            10                             9 
##             Gingerbread syrup                   Vegan Feast 
##                             9                             9 
##    Drinking chocolate spoons                         Muesli 
##                             8                             8 
##                     Nomad bag               Argentina Night 
##                             8                             7 
##              Coffee granules                      Empanadas 
##                             7                             7 
##              Victorian Sponge                        Basket 
##                             7                             6 
##                        Crepes           Half slice Monster  
##                             6                             6 
##                         Honey             Lemon and coconut 
##                             6                             6 
##                       Pintxos                  Bare Popcorn 
##                             6                             5 
##                      Mortimer                      Panatone 
##                             5                             5 
##                 Bread Pudding            Brioche and salami 
##                             4                             3 
##                 Caramel bites         Cherry me Dried fruit 
##                             3                             3 
## Raspberry shortbread sandwich                 Bowl Nic Pitt 
##                             3                             2 
##               Chimichurri Oil                   Fairy Doors 
##                             2                             2 
##                Hack the stack                      Siblings 
##                             2                             2 
##                        Spread                    Adjustment 
##                             2                             1 
##                         Bacon                  Chicken sand 
##                             1                             1 
##                  Gift voucher                Olum & polenta 
##                             1                             1 
##                       Polenta                      Raw bars 
##                             1                             1 
##                      The BART 
##                             1

This list of itme seems to be inconsistent and confusing. For example, there are items named “NONE”. Also, Brownie is separated from Cakes, Baguette is not considered as Bread, and Medialuna is treated differently from Pastry.

The item type “Adjustment” and “None” are probably introduced by the transaction tracking system or the cashier. In other words, there is no real purchase behind each of them. So we will drop them from the dataset. Then, the ‘Item_Type’ column is reorganized and coded as following:

Bread = 1

Cookies = 2

Cake|Pastry|Sweets = 3

Coffee = 4

Tea = 5

Hot chocolate|Smoothie|Juice = 6

Other beverage = 7

Meal = 8

Other = 9

*Ambiguous items are coded as ‘Other’

df <- df %>% filter(Item!='NONE' & Item!='Adjustment')
df$Item_Type <- 9
df$Item_Type[df$Item%in%c('Bread', 'Farm House', 'Toast','Baguette','Focaccia')]<-1
df$Item_Type[df$Item == 'Cookies']<-2
df$Item_Type[df$Item%in%c('Cake','Pastry','Medialuna','Brownie','Muffin','Alfajores','Scone','Scandinavian','Truffles','Tiffin','Fudge','Jammie Dodgers','Bakewell','Tartine','Vegan mincepie')]<-3
df$Item_Type[df$Item == 'Coffee']<-4
df$Item_Type[df$Item == 'Tea']<-5
df$Item_Type[df$Item%in%c('Hot chocolate', 'Juice', 'Smoothies')]<-6
df$Item_Type[df$Item%in%c('Mineral water', 'Coke')]<-7
df$Item_Type[df$Item%in%c('Sandwich', 'Soup', 'Spanish Brunch', 'Chicken Stew', 'Salad','Frittata')]<-8
df$Item_Type<-as.factor(df$Item_Type)
df%>%group_by(Item_Type)%>%
  summarize(Count = n()) %>%
  ggplot(aes(x=Item_Type,y=Count,fill=Item_Type,label = Count)) +
  geom_bar(stat="identity", width=0.7,alpha=0.7) +
  theme_minimal() + 
  scale_fill_manual(values = cookiecol,labels = c("1:Bread", "2:Cookies", "3:Cake|Pastry|Sweets",'4:Coffee','5:Tea','6:Hot chocolate|Smoothie|Juice','7:Other beverage','8:Meal','9:Other')) +
  geom_text(vjust = -0.5,size=3) + 
  labs(title = 'Item Frequency in Unique Transactions', x = 'Item Type', y = 'Number of Transactions') 
## `summarise()` ungrouping output (override with `.groups` argument)

From the Figure above, cakes/pastries/sweets are the most popular items in the bakery, followed by Coffee and Bread.

We also want to see if the transaction of each item varies by the hours in the day. We decided to focus on bread, cookies, cake/pastry/sweets, coffee and tea.

bread<-df%>%filter(Item_Type==1)%>%
  group_by(as.factor(Hour))%>%
  mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(df$Transaction))) %>%
  ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
  geom_bar(stat = 'identity',alpha=0.8) +
  facet_wrap(~ Item_Type) +
  labs(title = 'Average Number of Transactions of Bread by Hour', x = 'Hours', y = 'Avg No. of Transaction') +
  scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
  theme_minimal() +
  theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)


cookie<-df%>%filter(Item_Type==2)%>%
  group_by(as.factor(Hour))%>%
  mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(df$Transaction))) %>%
  ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
  geom_bar(stat = 'identity',alpha=0.8) +
  labs(title = 'Average Number of Transactions of Cookies by Hour', x = 'Hours', y = 'Avg No. of Transaction') +
  scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
  theme_minimal() +
  theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)

pastry<-df%>%filter(Item_Type==3)%>%
  group_by(as.factor(Hour))%>%
  mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(df$Transaction))) %>%
  ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
  geom_bar(stat = 'identity',alpha=0.8) +
  labs(title = 'Average Number of Transactions of Cakes/Pastries/Sweets by Hour', x = 'Hours', y = 'Avg No. of Transaction') +
  scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
  theme_minimal() +
  theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)

coffee<-df%>%filter(Item_Type==4)%>%
  group_by(as.factor(Hour))%>%
  mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(df$Transaction))) %>%
  ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
  geom_bar(stat = 'identity',alpha=0.8) +
  labs(title = 'Average Number of Transactions of Coffee by Hour', x = 'Hours', y = 'Avg No. of Transaction') +
  scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
  theme_minimal() +
  theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)

tea<-df%>%filter(Item_Type==5)%>%
  group_by(as.factor(Hour))%>%
  mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(df$Transaction))) %>%
  ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
  geom_bar(stat = 'identity',alpha=0.8) +
  labs(title = 'Average Number of Transactions of Tea by Hour', x = 'Hours', y = 'Avg No. of Transaction') +
  scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
  theme_minimal() +
  theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)

grid.arrange(bread,cookie,pastry,coffee,tea, nrow=3,ncol=2,
             top = textGrob("Average Transaction Frequency by Hours: Bread,Cookies,Pastries,Coffee, and Tea",gp=gpar(fontsize=14,font=3)))

The Figure above shows that bread and coffee on average tend to be sold more in the morning (~11am) while tea tends to be sold more in the afternoon. Transaction frequencies of cookies and cakes/pastries/sweet have peaks in both morning and afternoon.

3.3 The Edinburgh Weather Record

Let’s start with the high level summary of the dataset.

summary(df_weather)
##       Date                 Hour          Minute               temp       
##  Min.   :2016-01-01   Min.   : 0.00   Length:17544       Min.   :-8.000  
##  1st Qu.:2016-07-01   1st Qu.: 5.00   Class :character   1st Qu.: 5.000  
##  Median :2016-12-31   Median :11.00   Mode  :character   Median :10.000  
##  Mean   :2016-12-31   Mean   :11.50                      Mean   : 9.165  
##  3rd Qu.:2017-07-02   3rd Qu.:17.25                      3rd Qu.:13.000  
##  Max.   :2018-01-01   Max.   :23.00                      Max.   :28.000  
##                                                                          
##       dwpt             rhum          prcp           snow        
##  Min.   :-8.900   Min.   : 29.00   Mode:logical   Mode:logical  
##  1st Qu.: 2.100   1st Qu.: 76.00   NA's:17544     NA's:17544    
##  Median : 6.900   Median : 86.00                                
##  Mean   : 6.165   Mean   : 82.56                                
##  3rd Qu.:10.000   3rd Qu.: 93.00                                
##  Max.   :18.000   Max.   :100.00                                
##                                                                 
##       wdir            wspd         wpgt              pres        tsun        
##  Min.   : 10.0   Min.   : 0.00   Mode:logical   Min.   : 965   Mode:logical  
##  1st Qu.:130.0   1st Qu.: 7.60   NA's:17544     1st Qu.:1006   NA's:17544    
##  Median :240.0   Median :13.00                  Median :1014                 
##  Mean   :200.5   Mean   :15.04                  Mean   :1013                 
##  3rd Qu.:250.0   3rd Qu.:20.50                  3rd Qu.:1020                 
##  Max.   :360.0   Max.   :66.60                  Max.   :1040                 
##  NA's   :1140    NA's   :28                     NA's   :824                  
##       coco           station         key           
##  Min.   : 5.000   Min.   :3160   Length:17544      
##  1st Qu.: 7.000   1st Qu.:3160   Class :character  
##  Median : 7.000   Median :3160   Mode  :character  
##  Mean   : 7.139   Mean   :3160                     
##  3rd Qu.: 7.000   3rd Qu.:3160                     
##  Max.   :25.000   Max.   :3160                     
##  NA's   :14714

There are NA values in some of the columns. The columns “prcp”, “snow”, “wpgt” and “tsun” contain only NA values, so we will drop them.

The website of Meteostat gives clear explanation of each column (https://dev.meteostat.net/python/hourly.html#response-parameters):

  1. station: The Meteo ID of the weather station

  2. temp: The air temperature in C

  3. dwpt: The dew point in C

  4. rhum: The relative humidy in percent

  5. wdir: The average wind direction in degrees

  6. wspd: The average wind speed in km/h

  7. pres: The average sea-level air pressure in hPa

  8. coco: The weather condition code (https://dev.meteostat.net/docs/formats.html#weather-condition-codes). Only significant weather events are reported here.

Based on the description of each columns, we could drop the “station” column, as it remains the same for all the observations. We could also drop the “dwpt” column, as we already include the “temp” in our features. Another variable to drop is “pres”, as the air pressure is difficult to interpret for people without the background in Meteorology. For the missing values in “coco”, we will fill 0, as it stands for non-significant weather events.

We will also remove the “Minute” column as it is always 00.

df_weather <- subset(df_weather,
                     select = -c(Minute, prcp, snow, wpgt, tsun, station, dwpt, pres))

For the missing values in “wdir” and “wspd”, considering the consistency of weather conditions, we will use the value of the hour right after.

df_weather$coco[is.na(df_weather$coco)] <- 0

df_weather <- df_weather %>% tidyr::fill(wdir, .direction = "up")
df_weather <- df_weather %>% tidyr::fill(wspd, .direction = "up")

3.4 The Combined Dataset

We now will merge the two datasets based on the pre-defined keys.

df_merged <- merge(x = df, y = df_weather, by = 'key', all.x = TRUE)
df_merged$Date <- df_merged$Date.x
df_merged$Hour <- df_merged$Hour.x
df_merged <- subset(df_merged,
                    select = -c(key, Time, Date.y, Hour.y, Date.x, Hour.x))
df_merged$coco<-as.factor(df_merged$coco)
df_merged_feature <- subset(df_merged, select = -c(Transaction, Item, Weekday,Weekend, Item_Type, Date,Holiday,coco,Rush_hours))
#Calculate correlation here
corr <- round(cor(df_merged_feature), digits = 2)

#Use ggcorrplot to graph correlation. Only plot the lower triangle of the correlation matrix.
ggcorrplot(corr, type = "lower", 
           ggtheme = ggplot2::theme_minimal,
           lab = TRUE,
           colors = c(cookiecol[5],'white',cookiecol[7]))

The correlation table suggests possible correlation between the temperature and the humidity, and between the temperature and the speed of wind.

3.5 Dataset Construction

3.5.1 Predict Transactions By Day of week, Hour, and Weather Information

# To build the dataset for the regression problem
# Keep Hour as a feature
df_merged_reg<-df_merged%>%
  group_by(Date, Hour)%>%
  mutate(No_of_transaction = length(unique(Transaction)))%>%
  ungroup()%>%
  #select(-c(Transaction, Item, Item_Type,Date,Hour))%>%
  select(-c(Transaction, Item, Item_Type,Date))%>%
  unique()%>%
  data.frame()

# Keep Weekday as a feature
df_weekday<-df_merged_reg%>%
  filter(Weekend==0)%>%
  #select(-Weekday,-Weekend)%>%
  select(-Weekend)%>%
  data.frame()

summary(df_weekday)
##  Weekday   Rush_hours Holiday       temp            rhum             wdir      
##  Mon:201   0:308      0:1065   Min.   :-6.00   Min.   : 47.00   Min.   : 10.0  
##  Tue:225   1:790      1:  33   1st Qu.: 5.00   1st Qu.: 71.25   1st Qu.:200.0  
##  Wed:222                       Median : 7.00   Median : 81.00   Median :240.0  
##  Thu:228                       Mean   : 6.85   Mean   : 80.01   Mean   :212.5  
##  Fri:222                       3rd Qu.: 9.00   3rd Qu.: 87.00   3rd Qu.:260.0  
##  Sat:  0                       Max.   :15.00   Max.   :100.00   Max.   :360.0  
##  Sun:  0                                                                       
##       wspd       coco          Hour       No_of_transaction
##  Min.   : 0.00   0 :978   Min.   : 7.00   Min.   : 1.000   
##  1st Qu.: 9.40   5 : 20   1st Qu.:10.00   1st Qu.: 3.000   
##  Median :14.80   7 : 84   Median :12.00   Median : 5.000   
##  Mean   :16.84   8 :  2   Mean   :12.42   Mean   : 5.593   
##  3rd Qu.:24.10   12:  5   3rd Qu.:15.00   3rd Qu.: 8.000   
##  Max.   :55.40   17:  9   Max.   :20.00   Max.   :17.000   
## 
# Keep Weekday as a feature
df_weekend<-df_merged_reg%>%
  filter(Weekend==1)%>%
  #select(-Weekday,-Weekend)%>%
  select(-Weekend)%>%
  data.frame()

summary(df_weekend)
##  Weekday   Rush_hours Holiday      temp             rhum             wdir      
##  Mon:  0   0: 83      0:378   Min.   :-2.000   Min.   : 31.00   Min.   : 10.0  
##  Tue:  0   1:313      1: 18   1st Qu.: 4.000   1st Qu.: 76.00   1st Qu.:220.0  
##  Wed:  0                      Median : 7.000   Median : 82.00   Median :240.0  
##  Thu:  0                      Mean   : 7.452   Mean   : 81.41   Mean   :218.7  
##  Fri:  0                      3rd Qu.:10.000   3rd Qu.: 93.00   3rd Qu.:260.0  
##  Sat:226                      Max.   :16.000   Max.   :100.00   Max.   :360.0  
##  Sun:170                                                                       
##       wspd       coco          Hour       No_of_transaction
##  Min.   : 0.00   0 :333   Min.   : 7.00   Min.   : 1.000   
##  1st Qu.: 9.40   5 : 10   1st Qu.:10.00   1st Qu.: 5.000   
##  Median :14.80   7 : 48   Median :12.00   Median : 8.000   
##  Mean   :16.44   8 :  0   Mean   :12.44   Mean   : 8.359   
##  3rd Qu.:22.30   12:  1   3rd Qu.:15.00   3rd Qu.:11.000   
##  Max.   :48.20   17:  4   Max.   :20.00   Max.   :25.000   
## 

Next, we plot scatter plots of number of transaction against different weather features, respectively.

# scatter plots of number of transaction against different weather features, respectively
temp<-df_merged_reg%>%ggplot(aes(temp,No_of_transaction)) + 
  geom_jitter(alpha = 0.7, size=0.7, colour = cookiecol[3]) +
  labs(x='Temperature in Celcius', y='Number of Transaction') + 
  theme_minimal()

ws<-df_merged_reg%>%ggplot(aes(wspd,No_of_transaction)) + 
  geom_jitter(alpha = 0.7, size=0.7, colour = cookiecol[3]) +
  labs(x='Wind Speed', y='Number of Transaction') + 
  theme_minimal()

hum<-df_merged_reg%>%ggplot(aes(rhum,No_of_transaction)) + 
  geom_jitter(alpha = 0.7, size=0.7, colour = cookiecol[3]) +
  labs(x='Humidity', y='Number of Transaction') + 
  theme_minimal()

wcc<-df_merged_reg%>%ggplot(aes(coco,No_of_transaction)) + 
  geom_jitter(alpha = 0.7, size=0.7, colour = cookiecol[3]) +
  labs(x='Weather Condition Code', y='Number of Transaction') + 
  theme_minimal()

grid.arrange(temp,ws,hum,wcc, nrow = 2)

3.5.2 Market Basket Analysis

Here we will construct the dataset to perform market basket analysis at a later stage.

df_merged_mba <- df_merged

# map back the name of Item type

df_merged_mba$Item_Type_Name[df$Item_Type == 1] <- 'Bread'
df_merged_mba$Item_Type_Name[df$Item_Type == 2] <- 'Cookies'
df_merged_mba$Item_Type_Name[df$Item_Type == 3] <- 'Cake|Pastry|Sweets'
df_merged_mba$Item_Type_Name[df$Item_Type == 4] <- 'Coffee'
df_merged_mba$Item_Type_Name[df$Item_Type == 5] <- 'Tea'#'Other beverage'#'Tea'
df_merged_mba$Item_Type_Name[df$Item_Type == 6] <- 'Hot chocolate|Smoothie|Juice'#'Other beverage'
df_merged_mba$Item_Type_Name[df$Item_Type == 7] <- 'Other beverage'
df_merged_mba$Item_Type_Name[df$Item_Type == 8] <- 'Meal'
df_merged_mba$Item_Type_Name[df$Item_Type == 9] <- 'Other'


df_merged_mba_item_type_temp <- subset(df_merged_mba,
                                       select = c(Transaction, Item_Type_Name, Weekend))

df_merged_mba_item_type_temp <- distinct(df_merged_mba_item_type_temp)

# Remove the transactions including only one item type
df_merged_mba_item_type_temp <-
  df_merged_mba_item_type_temp %>%
  group_by(Transaction) %>%
  mutate(freq = n()) %>%
  data.frame()

df_merged_mba_item_type_temp <- df_merged_mba_item_type_temp[df_merged_mba_item_type_temp$freq > 1, ]


# MBA on category of items
df_merged_mba_item_type <- df_merged_mba_item_type_temp %>%
  group_by(Transaction) %>%
  summarise(basket = as.vector(list(Item_Type_Name)))
## `summarise()` ungrouping output (override with `.groups` argument)
# MBA on items
df_merged_mba_item <- df_merged_mba %>%
  group_by(Transaction) %>%
  mutate(basket = as.vector(list(Item)))%>%ungroup()%>%data.frame()

mba_wkday<-df_merged_mba_item%>%
  filter(Weekend==0)%>%
  select(-Weekend)

mba_wkend<-df_merged_mba_item%>%
  filter(Weekend==1)%>%
  select(-Weekend)

3.5.3 Study The Clusters of Transactions

Here we will add the number of type of items for each transaction.

df_merged_cluster<-df_merged%>%group_by(Transaction)%>%
  mutate(No_Item_Type = length(unique(Item)))%>%
  ungroup()%>%
  select(-Transaction,-Item, -Item_Type, -Weekday,-Date)%>%
  unique()%>%
  data.frame()

df_merged_cluster[,c(2:10)]<-sapply(df_merged_cluster[,c(2:10)], as.numeric)

cluster_wkday<-df_merged_cluster%>%filter(Weekend==0)%>%select(-Weekend)%>%data.frame()
cluster_wkend<-df_merged_cluster%>%filter(Weekend==1)%>%select(-Weekend)%>%data.frame()


summary(cluster_wkday)
##    Rush_hours     Holiday          temp             rhum            wdir      
##  Min.   :1.0   Min.   :1.00   Min.   :-6.000   Min.   : 47.0   Min.   : 10.0  
##  1st Qu.:2.0   1st Qu.:1.00   1st Qu.: 5.000   1st Qu.: 71.0   1st Qu.:200.0  
##  Median :2.0   Median :1.00   Median : 7.000   Median : 81.0   Median :240.0  
##  Mean   :1.8   Mean   :1.03   Mean   : 6.943   Mean   : 79.6   Mean   :212.3  
##  3rd Qu.:2.0   3rd Qu.:1.00   3rd Qu.: 9.000   3rd Qu.: 87.0   3rd Qu.:260.0  
##  Max.   :2.0   Max.   :2.00   Max.   :15.000   Max.   :100.0   Max.   :360.0  
##       wspd            coco            Hour        No_Item_Type   
##  Min.   : 0.00   Min.   :1.000   Min.   : 7.00   Min.   : 1.000  
##  1st Qu.: 9.40   1st Qu.:1.000   1st Qu.:10.00   1st Qu.: 1.000  
##  Median :14.80   Median :1.000   Median :12.00   Median : 2.000  
##  Mean   :17.01   Mean   :1.234   Mean   :12.38   Mean   : 2.269  
##  3rd Qu.:24.10   3rd Qu.:1.000   3rd Qu.:14.00   3rd Qu.: 3.000  
##  Max.   :55.40   Max.   :6.000   Max.   :20.00   Max.   :10.000
summary(cluster_wkend)
##    Rush_hours       Holiday           temp             rhum       
##  Min.   :1.000   Min.   :1.000   Min.   :-2.000   Min.   : 31.00  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.: 4.000   1st Qu.: 76.00  
##  Median :2.000   Median :1.000   Median : 7.000   Median : 82.00  
##  Mean   :1.859   Mean   :1.039   Mean   : 7.541   Mean   : 81.22  
##  3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:10.000   3rd Qu.: 93.00  
##  Max.   :2.000   Max.   :2.000   Max.   :16.000   Max.   :100.00  
##       wdir            wspd            coco            Hour        No_Item_Type
##  Min.   : 10.0   Min.   : 0.00   Min.   :1.000   Min.   : 7.00   Min.   :1.0  
##  1st Qu.:220.0   1st Qu.: 9.40   1st Qu.:1.000   1st Qu.:10.00   1st Qu.:1.0  
##  Median :240.0   Median :14.80   Median :1.000   Median :12.00   Median :2.0  
##  Mean   :219.6   Mean   :16.22   Mean   :1.335   Mean   :12.38   Mean   :2.6  
##  3rd Qu.:260.0   3rd Qu.:22.30   3rd Qu.:1.000   3rd Qu.:14.00   3rd Qu.:4.0  
##  Max.   :360.0   Max.   :48.20   Max.   :6.000   Max.   :20.00   Max.   :9.0

4 Modeling and Analysis

4.1 Predicting The Number of Transactions

The task of this analysis is to predict the number of transactions during that specific hour, knowing the date, hour, and weather forecast. More importantly, from the model created, we expert to further understand how the sales is driven by external variables, such as weather, day of the week, time of the day and so on.

Firstly, we will split the dataset into training set and validation set.

set.seed(0)

train_wkday<-sample(1:nrow(df_weekday), round(0.8*dim(df_weekday)[1]))
train_wkend <- sample(1:nrow(df_weekend), round(0.8*dim(df_weekend)[1]))
num_features <- dim(df_weekday)[2]

4.1.1 Lasso Regression Model

We will start with Lasso regression model. Weekday

x_wkday <- model.matrix(No_of_transaction~.,df_weekday)[,-1]
y_wkday <- df_weekday$No_of_transaction

y.test_wkday <- y_wkday[-train_wkday]

grid=10^seq(10,-2, length =100)
lasso.mod=glmnet(x_wkday[train_wkday,],y_wkday[train_wkday],alpha=1,lambda=grid)
plot(lasso.mod)
## Warning in regularize.values(x, y, ties, missing(ties)): collapsing to unique
## 'x' values

cv.out=cv.glmnet(x_wkday[train_wkday,],y_wkday[train_wkday],alpha=1)
plot(cv.out)

Then we run the lasso model on the test set to see the RMSE.

bestlam=cv.out$lambda.min
lasso.pred=predict(lasso.mod,s=bestlam,newx=x_wkday[-train_wkday,])
#RMSE
lasso_RMSE_wkday <- sqrt(mean((lasso.pred-y.test_wkday)^2))
lasso_RMSE_wkday
## [1] 2.880674
out <- glmnet(x_wkday,y_wkday,alpha=1,lambda=grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:8,]
lasso.coef
## (Intercept)  WeekdayTue  WeekdayWed  WeekdayThu  WeekdayFri  WeekdaySat 
##  3.65314614 -0.08972362 -0.38786168  0.00000000  1.04048780  0.00000000 
##  WeekdaySun Rush_hours1 
##  0.00000000  3.64261787
lasso.coef[lasso.coef!=0]
## (Intercept)  WeekdayTue  WeekdayWed  WeekdayFri Rush_hours1 
##  3.65314614 -0.08972362 -0.38786168  1.04048780  3.64261787

Weekend

x_wkend <- model.matrix(No_of_transaction~.,df_weekend)[,-1]
y_wkend <- df_weekend$No_of_transaction

y.test_wkend <- y_wkend[-train_wkend]

grid=10^seq(10,-2, length =100)
lasso.mod=glmnet(x_wkend[train_wkend,],y_wkend[train_wkend],alpha=1,lambda=grid)
plot(lasso.mod)
## Warning in regularize.values(x, y, ties, missing(ties)): collapsing to unique
## 'x' values

cv.out=cv.glmnet(x_wkend[train_wkend,],y_wkend[train_wkend],alpha=1)
plot(cv.out)

The test set RMSE of the model chosen by lasso is

bestlam=cv.out$lambda.min
lasso.pred=predict(lasso.mod,s=bestlam,newx=x_wkend[-train_wkend,])
#RMSE
lasso_RMSE_wkend <- sqrt(mean((lasso.pred-y.test_wkend)^2))
lasso_RMSE_wkend
## [1] 4.063711
out <- glmnet(x_wkend,y_wkend,alpha=1,lambda=grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:8,]
lasso.coef
## (Intercept)  WeekdayTue  WeekdayWed  WeekdayThu  WeekdayFri  WeekdaySat 
##    3.491346    0.000000    0.000000    0.000000    0.000000    2.214838 
##  WeekdaySun Rush_hours1 
##    0.000000    5.117108
lasso.coef[lasso.coef!=0]
## (Intercept)  WeekdaySat Rush_hours1 
##    3.491346    2.214838    5.117108

4.1.2 Random Forest Modle and GBM Modle

Random Forest Secondly, we try another model - random forest regression model. We will try several different sets of hyperparameters, including ‘mtry’ and ‘ntree’ to figure out the best random forest regression model as baseline. Weekday

search_grid <- expand.grid(mtry = c(round(num_features/3/2),
                                    round(num_features/3),
                                    round(num_features/3*2)),
                           ntree = c(100, 500, 1000))

min_RMSE <- Inf
best_mtry <- 0
best_ntree <- 0

for (i in seq(1, nrow(search_grid))){
  rf_reg_wkday <- randomForest(No_of_transaction~.,
                           df_weekday[train_wkday, ],
                           mtry = search_grid[i, ]$mtry,
                           ntree = search_grid[i, ]$ntree,
                           importance = TRUE
                           )
  if (mean(rf_reg_wkday$mse) < min_RMSE) {
    min_RMSE <- mean(rf_reg_wkday$mse)
    best_mtry <- search_grid[i, ]$mtry
    best_ntree <- search_grid[i, ]$ntree
    
  }

  }

# Train the random forest model using the best 'mtry' and 'ntree'.
rf_reg_best_wkday <- randomForest(No_of_transaction~.,
                           df_weekday[train_wkday, ],
                           mtry = best_mtry,
                           ntree = best_ntree,
                           importance = TRUE
                           )

rf_reg_best_wkday
## 
## Call:
##  randomForest(formula = No_of_transaction ~ ., data = df_weekday[train_wkday,      ], mtry = best_mtry, ntree = best_ntree, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 7.123584
##                     % Var explained: 33.56

Then we train a regression trees model using boosting algorithm to compare with the random forest model constructed. Weekday

search_grid <- expand.grid(shrinkage = c(0.01, 0.001),
                           interaction.depth = c(4, 5),
                           n.minobsinnode = c(10, 100),
                           bag.fraction = c(0.5, 0.8),
                           optimal_trees = 0,
                           min_RMSE = 0)

for (i in seq(1, nrow(search_grid))){

  boost.fit_wkday <- gbm(No_of_transaction~. ,
                  data = df_weekday[train_wkday, ], 
                  distribution = "gaussian",
                  cv.folds = 10,
                  n.trees = 4000,
                  train.fraction = 0.75,
                  n.cores = NULL,
                  verbose = FALSE,
                  shrinkage = search_grid[i, ]$shrinkage,
                  interaction.depth = search_grid[i, ]$interaction.depth,
                  n.minobsinnode = search_grid[i, ]$n.minobsinnode,
                  bag.fraction = search_grid[i, ]$bag.fraction,
                  )
  search_grid[i, ]$min_RMSE <- min(boost.fit_wkday$cv.error)
  search_grid[i, ]$optimal_trees = match(min(boost.fit_wkday$cv.error), boost.fit_wkday$cv.error)

}

# Train and fit this model with the best sets of hyperparameters:

boost_best_para_wkday <- 
  search_grid[search_grid$min_RMSE == min(search_grid$min_RMSE),]

boost.best_wkday <- gbm(No_of_transaction~. ,
                  data = df_weekday[train_wkday, ], 
                  distribution = "gaussian",
                  cv.folds = 10,
                  n.trees = 4000,
                  train.fraction = 0.75,
                  n.cores = NULL,
                  verbose = FALSE,
                  shrinkage = boost_best_para_wkday$shrinkage,
                  interaction.depth = boost_best_para_wkday$interaction.depth,
                  n.minobsinnode = boost_best_para_wkday$n.minobsinnode,
                  bag.fraction = boost_best_para_wkday$bag.fraction,
                  )

boost.best_wkday
## gbm(formula = No_of_transaction ~ ., distribution = "gaussian", 
##     data = df_weekday[train_wkday, ], n.trees = 4000, interaction.depth = boost_best_para_wkday$interaction.depth, 
##     n.minobsinnode = boost_best_para_wkday$n.minobsinnode, shrinkage = boost_best_para_wkday$shrinkage, 
##     bag.fraction = boost_best_para_wkday$bag.fraction, train.fraction = 0.75, 
##     cv.folds = 10, verbose = FALSE, n.cores = NULL)
## A gradient boosted model with gaussian loss function.
## 4000 iterations were performed.
## The best cross-validation iteration was 335.
## The best test-set iteration was 1078.
## There were 9 predictors of which 9 had non-zero influence.

Then we compare the performance of both the random forest model and the trees model using boosting algorithm on the Weekday validation set.

boost_pred_test_wkday <- predict(boost.best_wkday,
                           newdata = df_weekday[-train_wkday, ],
                           n.trees = 4000)
boost_RMSE_test_wkday <- caret::RMSE(boost_pred_test_wkday, df_weekday[-train_wkday, ]$No_of_transaction)

rf_pred_test_wkday <- predict(rf_reg_best_wkday,
                        newdata = df_weekday[-train_wkday, ])

rf_RMSE_test_wkday <- caret::RMSE(rf_pred_test_wkday, df_weekday[-train_wkday, ]$No_of_transaction)

print(paste0('RMSE of the boosting model on Weekday testing set:', round(boost_RMSE_test_wkday, 4)))
## [1] "RMSE of the boosting model on Weekday testing set:2.9753"
print(paste0('RMSE of the random forest model on Weekday testing set:', round(rf_RMSE_test_wkday, 4)))
## [1] "RMSE of the random forest model on Weekday testing set:2.8462"
print(paste0('RMSE of the lasso regression model on Weekday testing set:', round(lasso_RMSE_wkday, 4)))
## [1] "RMSE of the lasso regression model on Weekday testing set:2.8807"

For weekdays data, it turns out that the random forest model performs the best on the test set in term of RMSE. So we will choose the random forest model to explain how different features impact the number of transactions in a certain hour during a weekday. For the random forest model, the RMSE on the training set is 2.6818, which doesn’t suggest the risk of overfitting. However, one thing worth-noting is that our random forest could only explain 34% of variance. This is probably due to the very limited amount of features we have in our original dataset.

Weekend

search_grid <- expand.grid(mtry = c(round(num_features/3/2),
                                    round(num_features/3),
                                    round(num_features/3*2)),
                           ntree = c(100, 500, 1000))

min_RMSE <- Inf
best_mtry <- 0
best_ntree <- 0

for (i in seq(1, nrow(search_grid))){
  rf_reg_wkend <- randomForest(No_of_transaction~.,
                           df_weekend[train_wkend, ],
                           mtry = search_grid[i, ]$mtry,
                           ntree = search_grid[i, ]$ntree,
                           importance = TRUE
                           )
  if (mean(rf_reg_wkend$mse) < min_RMSE) {
    min_RMSE <- mean(rf_reg_wkend$mse)
    best_mtry <- search_grid[i, ]$mtry
    best_ntree <- search_grid[i, ]$ntree
    
  }

  }

# Train the random forest model using the best 'mtry' and 'ntree'.
rf_reg_best_wkend <- randomForest(No_of_transaction~.,
                           df_weekend[train_wkend, ],
                           mtry = best_mtry,
                           ntree = best_ntree,
                           importance = TRUE
                           )

rf_reg_best_wkend
## 
## Call:
##  randomForest(formula = No_of_transaction ~ ., data = df_weekend[train_wkend,      ], mtry = best_mtry, ntree = best_ntree, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 12.72002
##                     % Var explained: 41.38
search_grid <- expand.grid(shrinkage = c(0.01, 0.001),
                           interaction.depth = c(4, 5),
                           n.minobsinnode = c(10, 20),
                           bag.fraction = c(0.5, 0.8),
                           optimal_trees = 0,
                           min_RMSE = 0)

for (i in seq(1, nrow(search_grid))){

  boost.fit_wkend <- gbm(No_of_transaction~. ,
                  data = df_weekend[train_wkend, ], 
                  distribution = "gaussian",
                  cv.folds = 10,
                  n.trees = 4000,
                  train.fraction = 0.75,
                  n.cores = NULL,
                  verbose = FALSE,
                  shrinkage = search_grid[i, ]$shrinkage,
                  interaction.depth = search_grid[i, ]$interaction.depth,
                  n.minobsinnode = search_grid[i, ]$n.minobsinnode,
                  bag.fraction = search_grid[i, ]$bag.fraction,
                  )
  search_grid[i, ]$min_RMSE <- min(boost.fit_wkend$cv.error)
  search_grid[i, ]$optimal_trees = match(min(boost.fit_wkend$cv.error), boost.fit_wkend$cv.error)

}

# Train and fit this model with the best sets of hyperparameters:

boost_best_para_wkend <- 
  search_grid[search_grid$min_RMSE == min(search_grid$min_RMSE),]

boost.best_wkend <- gbm(No_of_transaction~. ,
                  data = df_weekend[train_wkend, ], 
                  distribution = "gaussian",
                  cv.folds = 10,
                  n.trees = 4000,
                  train.fraction = 0.75,
                  n.cores = NULL,
                  verbose = FALSE,
                  shrinkage = boost_best_para_wkend$shrinkage,
                  interaction.depth = boost_best_para_wkend$interaction.depth,
                  n.minobsinnode = boost_best_para_wkend$n.minobsinnode,
                  bag.fraction = boost_best_para_wkend$bag.fraction,
                  )

boost.best_wkend
## gbm(formula = No_of_transaction ~ ., distribution = "gaussian", 
##     data = df_weekend[train_wkend, ], n.trees = 4000, interaction.depth = boost_best_para_wkend$interaction.depth, 
##     n.minobsinnode = boost_best_para_wkend$n.minobsinnode, shrinkage = boost_best_para_wkend$shrinkage, 
##     bag.fraction = boost_best_para_wkend$bag.fraction, train.fraction = 0.75, 
##     cv.folds = 10, verbose = FALSE, n.cores = NULL)
## A gradient boosted model with gaussian loss function.
## 4000 iterations were performed.
## The best cross-validation iteration was 432.
## The best test-set iteration was 265.
## There were 9 predictors of which 8 had non-zero influence.
boost_pred_test_wkend <- predict(boost.best_wkend,
                           newdata = df_weekend[-train_wkend, ],
                           n.trees = 4000)
boost_RMSE_test_wkend <- caret::RMSE(boost_pred_test_wkend, df_weekend[-train_wkend, ]$No_of_transaction)

rf_pred_test_wkend <- predict(rf_reg_best_wkend,
                        newdata = df_weekend[-train_wkend, ])

rf_RMSE_test_wkend <- caret::RMSE(rf_pred_test_wkend, df_weekend[-train_wkend, ]$No_of_transaction)

print(paste0('RMSE of the boosting model on Weekend testing set:', round(boost_RMSE_test_wkend, 4)))
## [1] "RMSE of the boosting model on Weekend testing set:3.6731"
print(paste0('RMSE of the random forest model on Weekend testing set:', round(rf_RMSE_test_wkend, 4)))
## [1] "RMSE of the random forest model on Weekend testing set:3.5549"
print(paste0('RMSE of the lasso regression model on Weekend testing set:', round(lasso_RMSE_wkend, 4)))
## [1] "RMSE of the lasso regression model on Weekend testing set:4.0637"

Same as the case of weekday data, for weekend data, it seems that the random forest model selected performs better than the other model selected. For the random forest model, the RMSE on the training set is 3.5981, which doesn’t suggest the risk of overfitting. However, only 41% of variace could be explained by the selected random forest model. This is probably again due to the fact that we only have very limited features in our original dataset.

4.1.3 Impact of Features

Now that we have the regression model constructed, we would like to further understand the impact of different features.

Weekday

varImpPlot(rf_reg_best_wkday)

The most important factors are the time when the transaction happened and whether the time is the rush hours - this is not surprising. What’s interesting is that the weather related variables follow right after. We use partial dependence plot here to further understand the impact of weather related variables.

Weekday

par(mfrow = c(2, 2))
partialPlot(rf_reg_best_wkday, df_weekday[train_wkday, ], x.var = 'wdir')
partialPlot(rf_reg_best_wkday, df_weekday[train_wkday, ], x.var = 'wspd')
partialPlot(rf_reg_best_wkday, df_weekday[train_wkday, ], x.var = 'rhum')
partialPlot(rf_reg_best_wkday, df_weekday[train_wkday, ], x.var = 'temp')

From the partial dependence plot for Weekday data, we could learn that:

  1. Temperature seems to have a U-shape effect on number of transactions. When the temperature is approximately below 2 Celsius, number of transactions decreases as temperature increases; when it is approximately above 2 Celsius, number of transactions increases as temperature increases. Potentially because during weekdays, there could be increasing number of transactions during rush hour, especially in the morning, when the temperature tends to be low.

  2. When there is north wind or north-west wind (270-350), the sales will be higher than the case when the wind is blowing from other directions.

  3. Average-level humidity seems to have a positive effect on number of transactions; however, when humidity is close to or over 90, which suggests raining weather, there is a sharp decrease, showing a negative effect on number of transactions.

  4. Stronger wind has negative impact on the sales

To sum up, the most important variable in forecasting the number of transactions at a given time is whether it is during the rush hour. Also, different weather related variables have different impact on the number of transaction - so a practical suggestion to the bakery is to watch the weather forecast and be prepared.

Weekend

varImpPlot(rf_reg_best_wkend)

Similar as the case of weekdays, the hour when a transaction happened and whether it is during rush-hours are the most important features. Then the weather related variables also contribute in the model.

Weekend

par(mfrow = c(2, 2))
partialPlot(rf_reg_best_wkend, df_weekend[train_wkend, ], x.var = 'wdir')
partialPlot(rf_reg_best_wkend, df_weekend[train_wkend, ], x.var = 'wspd')
partialPlot(rf_reg_best_wkend, df_weekend[train_wkend, ], x.var = 'rhum')
partialPlot(rf_reg_best_wkend, df_weekend[train_wkend, ], x.var = 'temp')

From the partial dependence plot for Weekend data, we could learn that:

  1. Interestingly, in contrast to the data in weekdays, temperature no longer has a U-shape effect; instead, overall, as temperature increases, number of transactions increases. That’s probably because no one needs to go buy breakfast in a cold Saturday or Sunday morning.

  2. In weekends, when there is north wind or north-west wind (270-350), there is a huge increase in number of transactions.

  3. In weekends, as humidity increases, number of transactions decreases overall. Indeed, it decreases dramatically once the humidity is higher than 40. The reason why it is different comparing with weekday would be that only few people will hang out if weather condition is not good in a weekend.

  4. Just as weekdays, wind speed has a negative effect on number of transactions.

Comparing with the weekday case, the humidity and temperature impact the number of transactions differently. People have to go to work in weekdays regardless of the weather. But during weekend, the weather condition will play a more important role when people decide whether to go to the bakery.

4.2 Market Basket Analysis

The objective of this analysis is to identify possible association rules. For example, assuming we identify from the data that people who buy cake tend to buy cookie at the same time, then the bakery could consider to place these two item closer in the shop, or create bundle sales in order to boost the sales.

We have already re-grouped different items, so for this analysis, we will start from looking at the association rules among item types.

transactions_Item_Type <- as(df_merged_mba_item_type$basket, "transactions")
#inspect(transactions_Item_Type[1])

# Set the threshold low to identify more rules
# Set minlen = 2 to make sure there are items in LHS
rules_Item_Type <- apriori(transactions_Item_Type, 
                 parameter = list(support = 0.05, 
                                  confidence = 0.025,
                                  minlen = 2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##       0.025    0.1    1 none FALSE            TRUE       5    0.05      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 246 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[9 item(s), 4928 transaction(s)] done [0.00s].
## sorting and recoding items ... [8 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [27 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# Remove redundant rules
rules_Item_Type <- rules_Item_Type[!is.redundant(rules_Item_Type)]

# Reorder by Lift and display
rules_Item_Type_dt <- data.table( lhs = labels( lhs(rules_Item_Type) ), 
                        rhs = labels( rhs(rules_Item_Type) ), 
                        quality(rules_Item_Type) )[ order(-lift), ]
head(rules_Item_Type_dt, 10)
##                                lhs                            rhs    support
##  1:                        {Bread}                       {Coffee} 0.25000000
##  2:                       {Coffee}                        {Bread} 0.25000000
##  3:           {Cake|Pastry|Sweets}                       {Coffee} 0.27597403
##  4:                       {Coffee}           {Cake|Pastry|Sweets} 0.27597403
##  5:                        {Bread}           {Cake|Pastry|Sweets} 0.22970779
##  6:           {Cake|Pastry|Sweets}                        {Bread} 0.22970779
##  7:                         {Meal}                       {Coffee} 0.09517045
##  8:                       {Coffee}                         {Meal} 0.09517045
##  9: {Hot chocolate|Smoothie|Juice}           {Cake|Pastry|Sweets} 0.06513799
## 10:           {Cake|Pastry|Sweets} {Hot chocolate|Smoothie|Juice} 0.06513799
##     confidence  coverage      lift count
##  1:  0.5116279 0.4886364 0.8739349  1232
##  2:  0.4270364 0.5854302 0.8739349  1232
##  3:  0.5087916 0.5424107 0.8690902  1360
##  4:  0.4714038 0.5854302 0.8690902  1360
##  5:  0.4700997 0.4886364 0.8666858  1132
##  6:  0.4234942 0.5424107 0.8666858  1132
##  7:  0.4885417 0.1948052 0.8345003   469
##  8:  0.1625650 0.5854302 0.8345003   469
##  9:  0.4439834 0.1467127 0.8185373   321
## 10:  0.1200898 0.5424107 0.8185373   321
#rules_Item_Type_dt

From the table we observe that the highest lift of all rules is still smaller than 1. That says, no valid association rules identified among the item types.

Then we focus our market basket analysis on the detail list of items, instead of the category. Here we will analyze weekdays and weekends separately, as we are expecting different purchase behaviors.

4.2.1 Weekday Dataset Analysis and Discussion

Weekday

transactions_Item <- as(mba_wkday$basket, "transactions")
#inspect(transactions_Item[1])

# Set the threshold low to identify more rules
# Set minlen = 2 to make sure there are items in LHS

rules_Item <- apriori(transactions_Item, 
                 parameter = list(support = 0.005, 
                                  confidence = 0.25,
                                  minlen = 2)
                 )
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5   0.005      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 59 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[85 item(s), 11825 transaction(s)] done [0.00s].
## sorting and recoding items ... [39 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [115 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# Remove redundant rules
rules_Item <- rules_Item[!is.redundant(rules_Item)]

# Reorder by Lift and display
rules_Item_dt_wkday <- data.table( lhs = labels( lhs(rules_Item) ), 
                        rhs = labels( rhs(rules_Item) ), 
                        quality(rules_Item) )[ order(-lift), ]

kable(head(rules_Item_dt_wkday, 20))
lhs rhs support confidence coverage lift count
{Coffee,Coke} {Sandwich} 0.0059197 0.4347826 0.0136152 3.897880 70
{Juice,Tea} {Cookies} 0.0053277 0.3519553 0.0151374 3.735971 63
{Coke} {Sandwich} 0.0109937 0.3735632 0.0294292 3.349041 130
{Mineral water} {Sandwich} 0.0085412 0.3519164 0.0242706 3.154974 101
{Coffee,Truffles} {Sandwich} 0.0050740 0.3260870 0.0155603 2.923410 60
{Coffee,Juice} {Cookies} 0.0105708 0.2642706 0.0400000 2.805207 125
{Cake,Soup} {Tea} 0.0055814 0.5689655 0.0098097 2.794027 66
{Coffee,Soup} {Sandwich} 0.0085412 0.2869318 0.0297674 2.572380 101
{Bread,Soup} {Tea} 0.0054123 0.3950617 0.0136998 1.940035 64
{Soup,Tea} {Cake} 0.0055814 0.2738589 0.0203805 1.935674 66
{Coffee,Juice} {Cake} 0.0103171 0.2579281 0.0400000 1.823072 122
{Spanish Brunch} {Tea} 0.0076956 0.3625498 0.0212262 1.780379 91
{Chicken Stew} {Tea} 0.0078647 0.3419118 0.0230021 1.679031 93
{Scone} {Tea} 0.0098943 0.3371758 0.0293446 1.655774 117
{Soup} {Tea} 0.0203805 0.3370629 0.0604651 1.655220 241
{Cookies,Juice} {Tea} 0.0053277 0.3150000 0.0169133 1.546875 63
{Extra Salami or Feta} {Coffee} 0.0054968 0.8552632 0.0064271 1.524263 65
{Tiffin} {Tea} 0.0064271 0.3102041 0.0207188 1.523324 76
{Keeping It Local} {Coffee} 0.0085412 0.8145161 0.0104863 1.451643 101
{Cake} {Tea} 0.0401691 0.2839211 0.1414799 1.394255 475
subrules2 <- head(sort(rules_Item, by = "lift"), 20)
ig <- plot( subrules2, method="graph",
            control=list(type="items") )
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## main  =  Graph for 20 rules
## nodeColors    =  c("#66CC6680", "#9999CC80")
## nodeCol   =  c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF",  "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF",  "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol   =  c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF",  "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF",  "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha     =  0.5
## cex   =  1
## itemLabels    =  TRUE
## labelCol  =  #000000B3
## measureLabels     =  FALSE
## precision     =  3
## layout    =  NULL
## layoutParams  =  list()
## arrowSize     =  0.5
## engine    =  igraph
## plot  =  TRUE
## plot_options  =  list()
## max   =  100
## verbose   =  FALSE

Few observations could be made from the table:

During weekdays

  1. People who buy drinks, such as coffee, juice, tea, and mineral water, tends to buy Sandwich

  2. People who buy coffee and juice or juice and tea tends to buy cookies. Normally client won’t buy two drinks together if it is only for him or herself. We suspect in this case, it is someone buying drinks and cookies to share with friends.

  3. People who buy lunch, such as cake and soup, bread and soup, Spanish brunch, chicken stew, and soup alone, tends to buy tea at the same time.

  4. People who buy sweets, such as scone and cake, tends to buy tea at the same time.

To further explore the finding #4, we plot the distribution of transactions including scone or cake by hours.

p1<-mba_wkday%>%filter(Item=='Scone')%>%
  group_by(as.factor(Hour))%>%
  mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(mba_wkday$Transaction))) %>%
  ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
  geom_bar(stat = 'identity',alpha=0.8) +
  labs(title = 'Average Number of Transactions of Scone by Hour',
       x = 'Hours', y = 'Avg No. of Transaction') +
  scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
  theme_minimal() +
  theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)

p2<-mba_wkday%>%filter(Item=='Cake')%>%
  group_by(as.factor(Hour))%>%
  mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(mba_wkday$Transaction))) %>%
  ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
  geom_bar(stat = 'identity',alpha=0.8) +
  labs(title = 'Average Number of Transactions of Cake by Hour',
       x = 'Hours', y = 'Avg No. of Transaction') +
  scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
  theme_minimal() +
  theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)


grid.arrange(p1, p2,
             nrow=1,ncol=2,
             top = textGrob("Average Transaction Frequency by Hours",
                            gp=gpar(fontsize=14,font=3)))

It turns out that scone and cake are sold more in the afternoon. The association rules between scone, cake and tea probably refers to the snack in the afternoon.

Based on the above, maybe we could recommend the bakery, for weekdays:

  1. To create bundles sales for drinks and sandwiches

  2. To create bundles sales for multiple drinks and cookies or cakes

  3. To create bundles sales for lunch and tea

  4. To create bundles sales for afternoon snack and tea

4.2.2 Weekend Dataset Analysis and Discussion

Weekend

transactions_Item <- as(mba_wkend$basket, "transactions")
#inspect(transactions_Item[1])

# Set the threshold low to identify more rules
# Set minlen = 2 to make sure there are items in LHS
rules_Item_wkend <- apriori(transactions_Item, 
                 parameter = list(support = 0.01, 
                                  confidence = 0.25,
                                  minlen = 2)
                 )
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 70 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[83 item(s), 7047 transaction(s)] done [0.00s].
## sorting and recoding items ... [34 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [84 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# Remove redundant rules
rules_Item_wkend <- rules_Item_wkend[!is.redundant(rules_Item_wkend)]

# Reorder by Lift and display
rules_Item_dt_wkend <- data.table( lhs = labels( lhs(rules_Item_wkend) ), 
                        rhs = labels( rhs(rules_Item_wkend) ), 
                        quality(rules_Item_wkend) )[ order(-lift), ]

kable(head(rules_Item_dt_wkend, 10))
lhs rhs support confidence coverage lift count
{Farm House} {Medialuna} 0.0102171 0.2599278 0.0393075 2.351362 72
{Jammie Dodgers} {Cake} 0.0103590 0.3273543 0.0316447 1.909657 73
{Coffee,Tea} {Cake} 0.0261104 0.2813456 0.0928054 1.641260 184
{Coffee,Hot chocolate} {Cake} 0.0184476 0.2771855 0.0665531 1.616992 130
{Bread,Tea} {Cake} 0.0129133 0.2684366 0.0481056 1.565954 91
{Cake} {Tea} 0.0488151 0.2847682 0.1714205 1.530711 344
{Tea} {Cake} 0.0488151 0.2623951 0.1860366 1.530711 344
{Hot chocolate} {Cake} 0.0278133 0.2525773 0.1101178 1.473437 196
{Scone} {Tea} 0.0224209 0.2585925 0.0867036 1.390008 158
{Hot chocolate,Pastry} {Coffee} 0.0100752 0.7553191 0.0133390 1.335692 71
# Graph for 10 rules - Weekend
subrules2 <- head(sort(rules_Item_wkend, by = "lift"), 10)
ig <- plot( subrules2, method="graph",
            control=list(type="items") )
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## main  =  Graph for 10 rules
## nodeColors    =  c("#66CC6680", "#9999CC80")
## nodeCol   =  c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF",  "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF",  "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol   =  c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF",  "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF",  "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha     =  0.5
## cex   =  1
## itemLabels    =  TRUE
## labelCol  =  #000000B3
## measureLabels     =  FALSE
## precision     =  3
## layout    =  NULL
## layoutParams  =  list()
## arrowSize     =  0.5
## engine    =  igraph
## plot  =  TRUE
## plot_options  =  list()
## max   =  100
## verbose   =  FALSE

In weekends, the association rules concentrate on few items, including cake, tea, coffee, hot chocolate, and scone. One thing worth-noting is that cake is in the center of the graph. In other words, people tends to buy cake when they are purchasing bunch of other items. The association rules also suggest that people sometime buy items to share with others, as there are transactions including both coffee and tea or coffee and hot chocolate.

Since cake is the center of the associations, we are curious when the sales of cakes peak and how the sales of different drinks associated with cakes vary by the hours of the day.

p3<-mba_wkend%>%filter(Item=='Cake')%>%
  group_by(as.factor(Hour))%>%
  mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(mba_wkend$Transaction))) %>%
  ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
  geom_bar(stat = 'identity',alpha=0.8) +
  labs(title = 'Average Number of Transactions of Cake by Hour',
       x = 'Hours', y = 'Avg No. of Transaction') +
  scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
  theme_minimal() +
  theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)

p4<-mba_wkend%>%filter(Item=='Coffee')%>%
  group_by(as.factor(Hour))%>%
  mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(mba_wkend$Transaction))) %>%
  ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
  geom_bar(stat = 'identity',alpha=0.8) +
  labs(title = 'Average Number of Transactions of Coffee by Hour',
       x = 'Hours', y = 'Avg No. of Transaction') +
  scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
  theme_minimal() +
  theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)

p5<-mba_wkend%>%filter(Item=='Tea')%>%
  group_by(as.factor(Hour))%>%
  mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(mba_wkend$Transaction))) %>%
  ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
  geom_bar(stat = 'identity',alpha=0.8) +
  labs(title = 'Average Number of Transactions of Tea by Hour',
       x = 'Hours', y = 'Avg No. of Transaction') +
  scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
  theme_minimal() +
  theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)

p6<-mba_wkend%>%filter(Item=='Hot chocolate')%>%
  group_by(as.factor(Hour))%>%
  mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(mba_wkend$Transaction))) %>%
  ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
  geom_bar(stat = 'identity',alpha=0.8) +
  labs(title = 'Average Number of Transactions of Hot chocolate by Hour',
       x = 'Hours', y = 'Avg No. of Transaction') +
  scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
  theme_minimal() +
  theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)

grid.arrange(p3,p4, p5,p6,
             nrow=2,ncol=2,
             top = textGrob("Average Transaction Frequency by Hours",
                            gp=gpar(fontsize=14,font=3)))

From the figures above, coffee and hot chocolate are sold more in the morning, as opposed to tea and hot chocolate. Sales of cake peak in the afternoon. It is likely that in weekends, people hang out together at the bakery to get the cake and multiple drinks. The bakery could make a special weekend afternoon combo for cake and drinks.

Based on the observations, the recommendations to the bakery would be:

  1. During weekends, place the cake in a visible place closer to the cashier

  2. Keep the combo of multiple drinks and cake

4.3 Clustering Analysis and Discussion

In this section, we will explore the different clusters of transactions, if there is any.

4.3.1 Clustering on Weekday Dataset

Weekday

fviz_nbclust(
  cluster_wkday,
  kmeans,
  k.max = 10,
  method = "wss"
)

fviz_nbclust(
  cluster_wkday,
  kmeans,
  k.max = 10,
  method = "silhouette"
)

Both figures suggest that the best K should be 2. So we will start with k = 2.

km_out_2 <- kmeans(cluster_wkday, 2, nstart = 25)

fviz_cluster(km_out_2, cluster_wkday, geom = 'point', ellipse.type = 'norm',
             main = 'Weekday Dataset: K-Means Clustering Results with K=2')

Apparently, PCA might not be the ideal way to extract the features for visualization. The first and second principle components together could only explain 37% of variance.

Then to understand how well we split the transactions into two clusters, here we compare the centers of the clusters.

km_out_2$centers
##   Rush_hours  Holiday     temp     rhum      wdir     wspd     coco     Hour
## 1   1.818898 1.000000 5.645669 82.13858  74.94488 13.34630 1.188976 12.27874
## 2   1.794104 1.039002 7.316100 78.86395 251.90476 18.06277 1.246712 12.41224
##   No_Item_Type
## 1     2.305512
## 2     2.258957

The major difference between the center of two clusters is the direction of wind. This is difficult to explain and therefore we doubt whether the current clustering makes sense.

4.3.2 Clustering on Weekend Dataset

Weekend

fviz_nbclust(
  cluster_wkend,
  kmeans,
  k.max = 10,
  method = "wss"
)

fviz_nbclust(
  cluster_wkend,
  kmeans,
  k.max = 10,
  method = "silhouette"
)

Both figures suggest that the best K should be 2. So we will start with k = 2.

km_out_2 <- kmeans(cluster_wkend, 2, nstart = 25)

fviz_cluster(km_out_2, cluster_wkend, geom = 'point', ellipse.type = 'norm',
             main = 'Weekend Dataset: K-Means Clustering Results with K=2')

Same as the weekdays dataset, PCA might not be the ideal way to extract the features for visualization. The first and second principle components together could only explain 38% of variance.

Then to understand how well we split the transactions into two clusters, here we compare the centers of the clusters.

km_out_2$centers
##   Rush_hours  Holiday     temp     rhum      wdir     wspd     coco     Hour
## 1   1.885845 1.000000 5.538813 84.05936  45.70776 14.36256 1.561644 12.21918
## 2   1.853953 1.046512 7.948837 80.64093 255.04186 16.59395 1.289302 12.41488
##   No_Item_Type
## 1     2.643836
## 2     2.590698

Again, the major difference between the center of two clusters is the direction of wind. This is difficult to explain and therefore we doubt whether the current clustering makes sense.

To sum up, we are not able to come up with a meaningful split of transactions into clusters. On one hand, we only tried with k-means. On the other hand, we have very limited features in the original dataset.

5 Conclusion & Next Steps

In a nutshell, with the bakery transaction data and weather record data, we first explored how different features influencing the number of transactions, especially those weather-related variables. Then we dived into the association rules from the transaction history, identified purchasing patterns, and provided recommendations for the bakery to further increase sales. Lastly we tried to cluster the transactions into different segments, but the result is not satisfying.

The major next steps of this analysis are:

  1. When predicting the number of transactions, the selected random forest model could only explain 30% - 40% of variance. We will need to either acquire more data beyond Oct 2016 and Apr 2017, or get more new features. Our current features only include date, time, name of items, and weather. If we could manage to get the information of the client, we could further improve our model to perform the prediction task.

  2. The association rules we identified has low support in general. If we could get more data, we might be able to further explore the the possible rules.

  3. We will need features to reshape the clustering analysis. Also as next step, we could try other clustering methods, such as GMM or DPGMM.

  4. Whether a day is a public holiday or not is not significant in any of our analysis. Part of the reason is that our dataset only lasts 6 months. If we could acquire more data, covering several years, we could further explore the impact of holiday on the sales of the bakery.